Matplotlib

As it turns out, people are very terrible at understanding numerical data, but can process and interpret visual information at remarkable speeds -- quite the opposite of computers in fact, and as such, you will nearly always want some sort of visual to accompany your analysis. In this exercise, we'll be using Matplotlib, a package in SciPy, utilizying MATLAB-like syntax, to generate many plots.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Basics Using Generated Data

1 - Generate three arrays of 500 values, $x$, $y_1$, $y_2$ such that $$\{x \mid -2\pi \le x \le 2\pi \}$$ $$y_1 = sin(x)$$ $$y_2 = cos(x)$$



In [2]:

    
x = np.linspace(-2*np.pi, 2*np.pi, 500)
y1 = np.sin(x)
y2 = np.cos(x)

2 - Using the default settings, use pyplot to plot $y_1$ and $y_2$ versus $x$, all on the same plot.



In [3]:

    
plt.plot(x, y1)
plt.plot(x, y2);

3 - Generate the same plots, but set the horizontal and vertical limits to be slightly smaller than the default settings. In otherwords, tighten up the plot a bit.'



In [4]:

    
# Switching to explicit plot
fig = plt.figure()
ax = plt.axes()

ax.plot(x, y1)
ax.plot(x, y2)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1);

4 - Generate the same plots using all settings from above, but now change the color and thickness of each from the defaults. Play around with the values a bit until you are satisfied with how they look.



In [5]:

    
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3)
ax.plot(x, y2, c='cyan', linewidth=3)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1);

5 - Generate the same plots using all settings from above, but now add some custom tickmarks with labels of your choosing. Which values would make sense given the functions we are using?



In [6]:

    
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3)
ax.plot(x, y2, c='cyan', linewidth=3)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']));

6 - Generate the same plots using all the settings from above, but now change your plot spines so that they are centered at the origin. In other words, change the plot area from a "box" to a "cross".



In [7]:

    
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5)
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))

# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None');

7 - Generate the same plots using all the settings from above, but now add a legend, with labels sine and cosine, to your plot in a position of your choosing.



In [8]:

    
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5, label='sine')
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5, label='cosine')

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))

# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None')

# Add the legend in the lower left corner
ax.legend(loc='lower left');

8 - Now generate two more data sets, $$y_3 = sin(x) + sin(2x)$$ $$y_3 = cos(x) + cos(2x)$$ and add them to your plot, setting different color and line styles (for example, dotted). Be sure to adjust your scales and legend as needed. Also add a title to your plot.



In [9]:

    
y3 = np.sin(x) + np.sin(2*x)
y4 = np.cos(x) + np.cos(2*x)



In [10]:

    
fig = plt.figure(figsize=(8,8))
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5, label='$\sin(x)$')
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5, label='$\cos(x)$')

# Add the new functions
ax.plot(x, y3, c='red', linewidth=3, alpha=.5, label='$\sin(x) + \sin(2x)$')
ax.plot(x, y4, c='blue', linewidth=3, alpha=.5, label='$\cos(x) + \cos(2x)$')

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-3, 3)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))

# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None')

# Add the legend in the lower left corner
ax.legend(loc='lower left', frameon=False)

# Set the title
ax.set_title('Some trigonometric functions');

More Plots With Real Data

In this exercise we'll be using a real data set to test out the functionality of matplotlib.

1 - Go to the R Data Repository and download, or load directly, the Aircraft Crash data, load it into a Data Frame, and print the first few rows.



In [11]:

    
crash = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/gamclass/airAccs.csv')



In [12]:

    
# Change the first column name to 'id'
col = crash.columns.values
col[0] = 'id'
crash.columns = col

crash.head()









    Out[12]:






  
    
      
      id
      Date
      location
      operator
      planeType
      Dead
      Aboard
      Ground
    
  
  
    
      0
      1
      1908-09-17
      Fort Myer, Virginia
      Military - U.S. Army
      Wright Flyer III
      1.0
      2.0
      0.0
    
    
      1
      2
      1912-07-12
      Atlantic City, New Jersey
      Military - U.S. Navy
      Dirigible
      5.0
      5.0
      0.0
    
    
      2
      3
      1913-08-06
      Victoria, British Columbia, Canada
      Private
      Curtiss seaplane
      1.0
      1.0
      0.0
    
    
      3
      4
      1913-09-09
      Over the North Sea
      Military - German Navy
      Zeppelin L-1 (airship)
      14.0
      20.0
      0.0
    
    
      4
      5
      1913-10-17
      Near Johannisthal, Germany
      Military - German Navy
      Zeppelin L-2 (airship)
      30.0
      30.0
      0.0

2 - Generate a histogram for the number of deaths, using bin sizes of your choice. Be sure to adjust the axis and to add a title to make your plot aesthetically appealing.

First, let's take a look at the summary statistics of the data:



In [13]:

    
crash.describe()









    Out[13]:






  
    
      
      id
      Dead
      Aboard
      Ground
    
  
  
    
      count
      5666.000000
      5655.000000
      5625.000000
      5592.000000
    
    
      mean
      2833.500000
      19.811848
      27.375822
      1.543455
    
    
      std
      1635.777644
      32.520087
      42.564764
      52.301120
    
    
      min
      1.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      1417.250000
      3.000000
      5.000000
      0.000000
    
    
      50%
      2833.500000
      9.000000
      13.000000
      0.000000
    
    
      75%
      4249.750000
      22.000000
      30.000000
      0.000000
    
    
      max
      5666.000000
      583.000000
      644.000000
      2750.000000

The column containing the number of deaths seems to be skewed to the right, so we expect a plot with some isolated bars to the right:



In [14]:

    
with plt.style.context('seaborn-white'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # drop nans because they cause an error with the hist command
    n, bins, patches = ax.hist(crash['Dead'].dropna(), bins=50)
    # set x axis limits
    ax.set_xlim(0,600)
    # adjust the number of ticks
    ax.xaxis.set_major_locator(plt.MaxNLocator(50))
    # add a title to the plot and the x axis
    ax.set_title('Deaths in plane crashes')
    ax.set_xlabel('Number of deaths')

Indeed there are many bars to the right which result almost invisible, let's look at the bins calculated by the hist command to see the values for such bars:



In [15]:

    
print(bins)
print(n)









    



[   0.     11.66   23.32   34.98   46.64   58.3    69.96   81.62   93.28
  104.94  116.6   128.26  139.92  151.58  163.24  174.9   186.56  198.22
  209.88  221.54  233.2   244.86  256.52  268.18  279.84  291.5   303.16
  314.82  326.48  338.14  349.8   361.46  373.12  384.78  396.44  408.1
  419.76  431.42  443.08  454.74  466.4   478.06  489.72  501.38  513.04
  524.7   536.36  548.02  559.68  571.34  583.  ]
[  3.20400000e+03   1.12900000e+03   4.83000000e+02   2.69000000e+02
   1.50000000e+02   8.10000000e+01   7.80000000e+01   5.70000000e+01
   3.70000000e+01   3.80000000e+01   3.20000000e+01   1.50000000e+01
   1.30000000e+01   2.10000000e+01   8.00000000e+00   8.00000000e+00
   5.00000000e+00   1.00000000e+00   2.00000000e+00   6.00000000e+00
   2.00000000e+00   1.00000000e+00   5.00000000e+00   3.00000000e+00
   1.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
   1.00000000e+00   2.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.00000000e+00]

As you can see, two thirds of the bins (approximately the ones over 180 deaths) have zero or one counts: this, combined with the greater values for the first beans, is causing the last bars to be invisible.

We can try to solve this in a few ways, first let's try to cut the y bars and set different units for the upper parts of the higher bars:



In [16]:

    
with plt.style.context('seaborn-white'):
    # create two subplots, one for the higher bars and the other for the lower parts of every bar
    fig, ax = plt.subplots(2, 1, sharex='col', figsize=(16,8))
    # plot the lower part of bars by setting the limit of the y axis to 90
    ax[1].hist(crash['Dead'].dropna(), bins=50)
    ax[1].set_xlim(0,600)
    ax[1].set_ylim(0,90)
    ax[1].xaxis.set_major_locator(plt.MaxNLocator(50))
    # plot the higher part of bars by setting the limit of the y axis from 100 to 3500
    ax[0].hist(crash['Dead'].dropna(), bins=50)
    ax[0].set_ylim(100,3500)
    # add title and x axis label
    ax[0].set_title('Deaths in plane crashes')
    ax[1].set_xlabel('Number of deaths')
    # delete the spines between the plots
    ax[0].spines['bottom'].set_color('None')
    ax[1].spines['top'].set_color('None')
    # add dashes to indicate the cut in the y axis (shamelessly copying code from stackoverflow!)
    d = .01
    kwargs = dict(transform=ax[0].transAxes, color='k', clip_on=False)
    ax[0].plot((-d,+d),(-d,+d), **kwargs)
    ax[0].plot((1-d,1+d),(-d,+d), **kwargs)
    kwargs.update(transform=ax[1].transAxes)
    ax[1].plot((-d,+d),(1-d,1+d), **kwargs)
    ax[1].plot((1-d,1+d),(1-d,1+d), **kwargs)

Another possible solution is to clip the values and use the last bin to represent all the values over a certain number of deaths:



In [17]:

    
# Function for formatting the label of the clipped bin
def hist_formatter(value, pos):
    if value == 100:
        return ''
    elif value == 98:
        return str(int(value)) + '+'
    else:
        return str(int(value))



In [18]:

    
with plt.style.context('seaborn-white'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # clip the values over 100 and create 50 bins
    ax.hist(np.clip(crash['Dead'].dropna(), 0, 100), bins=50);
    # set x axis limit to 100
    ax.set_xlim(0,100)
    # add 50 ticks (start and end of the bar)
    ax.xaxis.set_major_locator(plt.MaxNLocator(50))
    # format ticks so to have 98+ for the last bar
    ax.xaxis.set_major_formatter(plt.FuncFormatter(hist_formatter))
    # add title and x acis label
    ax.set_title('Deaths in plane crashes')
    ax.set_xlabel('Number of deaths')

3 - Make some plots of total number of deaths with respect to time, making use of Pandas time series functionality. Again, be sure to make your plot aesthetically appealing.

First, let's transform the Date column in DateTime format and set it as the index of the DataFrame:



In [20]:

    
crash['Date'] = crash['Date'].apply(lambda x: pd.datetime.strptime(x,'%Y-%m-%d'))
crash.set_index(['Date'], inplace=True)

Now let's create a new DataFrame containing yearly aggregates of the data:



In [21]:

    
# Resample at the year start taking the sum and using 0 where the sum is NaN
yearly_crash = crash.resample('AS').sum().fillna(0)
yearly_crash.head()

Last but not least, the plot:



In [22]:

    
with plt.style.context('seaborn'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # Plot the Dead column
    ax.plot(yearly_crash['Dead'])
    # Add a title
    ax.set_title('Deaths in plane crashes by year')
    # Set limits for the axis
    ax.set_xlim(yearly_crash.index.values.min(), yearly_crash.index.values.max())
    ax.set_ylim(-100, 3200)
    # Increase the number of ticks
    ax.xaxis.set_major_locator(plt.MaxNLocator(20));

There are two peaks of over 2500 deaths in a year between 1972 and 1985, let's determine which years they are:



In [23]:

    
yearly_crash[yearly_crash['Dead'] > 2500]

The years are 1972 and 1985, let's take a closer look at them:



In [24]:

    
crash.loc['1985'].describe()









    Out[24]:






  
    
      
      id
      Dead
      Aboard
      Ground
    
  
  
    
      count
      74.000000
      74.000000
      74.000000
      74.000000
    
    
      mean
      3757.500000
      36.081081
      47.027027
      0.013514
    
    
      std
      21.505813
      80.429627
      87.298727
      0.116248
    
    
      min
      3721.000000
      0.000000
      1.000000
      0.000000
    
    
      25%
      3739.250000
      3.000000
      4.000000
      0.000000
    
    
      50%
      3757.500000
      8.000000
      11.000000
      0.000000
    
    
      75%
      3775.750000
      28.250000
      44.750000
      0.000000
    
    
      max
      3794.000000
      520.000000
      524.000000
      1.000000



In [25]:

    
crash.loc['1985'].sort_values('Dead', ascending=False).head(10)









    Out[25]:






  
    
      
      id
      location
      operator
      planeType
      Dead
      Aboard
      Ground
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      1985-08-12
      3765
      Mt. Osutaka, near Ueno Village, Japan
      Japan Air Lines
      Boeing B-747-SR46
      520.0
      524.0
      0.0
    
    
      1985-06-23
      3757
      Atlantic Ocean, 110 miles West of Ireland
      Air India
      Boeing B-747-237B
      329.0
      329.0
      0.0
    
    
      1985-12-12
      3792
      Gander, Newfoundland, Canada
      Arrow Airways
      McDonnell Douglas DC-8 Super 63PF
      256.0
      256.0
      0.0
    
    
      1985-07-10
      3760
      Near Uchuduk, Uzbekistan, USSR
      Aeroflot
      Tupolev TU-154B-2
      200.0
      200.0
      0.0
    
    
      1985-02-19
      3736
      Near Durango, Vizcaya, Spain
      Iberia Airlines
      Boeing B-727-256
      148.0
      148.0
      0.0
    
    
      1985-08-02
      3763
      Ft. Worth-Dallas, Texas
      Delta Air Lines
      Lockheed L-1011-1 TriStar
      134.0
      163.0
      1.0
    
    
      1985-07-24
      3762
      Leticia, Colombia
      Fuerza A????rea Colombiana
      Douglas DC-6B
      80.0
      80.0
      0.0
    
    
      1985-05-03
      3749
      Near L'vov, Ukraine, USSR
      Aeroflot / Soviet Air Force
      Tupolev TU-134A / Antonov An-26
      76.0
      76.0
      0.0
    
    
      1985-01-21
      3726
      Reno, Nevada
      Galaxy Airlines
      Lockheed L-188A Electra
      70.0
      71.0
      0.0
    
    
      1985-11-24
      3790
      Luqa, Malta
      EgyptAir
      Boeing B-737-266
      60.0
      103.0
      0.0



In [26]:

    
crash.loc['1996'].describe()









    Out[26]:






  
    
      
      id
      Dead
      Aboard
      Ground
    
  
  
    
      count
      81.000000
      81.000000
      81.000000
      81.000000
    
    
      mean
      4630.000000
      29.456790
      37.753086
      3.530864
    
    
      std
      23.526581
      57.069705
      64.485760
      25.275525
    
    
      min
      4590.000000
      1.000000
      1.000000
      0.000000
    
    
      25%
      4610.000000
      3.000000
      6.000000
      0.000000
    
    
      50%
      4630.000000
      8.000000
      14.000000
      0.000000
    
    
      75%
      4650.000000
      20.000000
      28.000000
      0.000000
    
    
      max
      4670.000000
      349.000000
      349.000000
      225.000000



In [27]:

    
crash.loc['1996'].sort_values('Dead', ascending=False).head(10)









    Out[27]:






  
    
      
      id
      location
      operator
      planeType
      Dead
      Aboard
      Ground
    
    
      Date
      
      
      
      
      
      
      
    
  
  
    
      1996-11-12
      4655
      Near Charkhi Dadri, India
      Saudi Arabian Airlines / Kazakhstan Airlines
      Boeing B-747-168B / Ilyushin IL-76TD
      349.0
      349.0
      0.0
    
    
      1996-07-17
      4631
      Off East Moriches, New York
      Trans World Airlines
      Boeing B-747-131
      230.0
      230.0
      0.0
    
    
      1996-02-06
      4598
      Off Puerto Plata, Domincan Republic
      Alas Nacionales, leased from Birgen Air
      Boeing B-757-225
      189.0
      189.0
      0.0
    
    
      1996-11-07
      4653
      Lagos, Nigeria
      Aviation Development Corporation
      Boeing B-727-231
      143.0
      143.0
      0.0
    
    
      1996-08-29
      4639
      Spitsbergen, Norway
      Vnokovo Airlines
      Tupolev TU-154M
      141.0
      141.0
      0.0
    
    
      1996-11-23
      4660
      Off Moroni, Comoros
      Ethiopian Airlines
      Boeing B-767-200ER
      127.0
      175.0
      0.0
    
    
      1996-02-29
      4605
      Arequipa, Peru
      Compania de Aviacion Faucett SA (Peru)
      Boeing B-737-222
      123.0
      123.0
      0.0
    
    
      1996-05-11
      4618
      Everglades, Miami, Florida
      ValuJet
      McDonnell Douglas DC-9-32
      110.0
      110.0
      0.0
    
    
      1996-10-31
      4651
      Sao Paolo, Brazil
      TAM (Brazil)
      Fokker 100
      95.0
      95.0
      3.0
    
    
      1996-02-26
      4603
      Near Jabal Awliya, Sudan
      Military - Sudanese Air Force
      Lockheed C-130H
      91.0
      91.0
      0.0

We can add some annotations to the plot to describe them:



In [28]:

    
with plt.style.context('seaborn'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # Plot the Dead column
    ax.plot(yearly_crash['Dead'])
    # Add a title
    ax.set_title('Deaths in plane crashes by year')
    # Set limits for the axis
    ax.set_xlim(yearly_crash.index.values.min(), yearly_crash.index.values.max())
    ax.set_ylim(-100, 3200)
    # Increase the number of ticks
    ax.xaxis.set_major_locator(plt.MaxNLocator(20))
    # Add some annotations on the peaks
    style = dict(size=10, color='black')
    ax.text('1972-01-01', 3020, '1972: 105 crashes with a mean of 28 deads per crash', ha='center', **style)
    ax.text('1985-01-01', 2700, '1985: 520 deads in Mt. Osutaka crash', ha='center', **style);

4 - We're now going to add in some data from a different source to take a look at the bigger picture in terms of number of passengers flying each year. Head over to the World Bank Webpage and download the .csv version of the data in the link. Clean it up and merge it with your original aircraft accident data above. Call this merged data set data_all.



In [29]:

    
# Load the data and set country name as the index
passengers = pd.read_csv('API_IS.AIR.PSGR_DS2_en_csv_v2.csv', skiprows=4)
passengers.set_index('Country Name',inplace=True)
passengers.head()









    Out[29]:






  
    
      
      Country Code
      Indicator Name
      Indicator Code
      1960
      1961
      1962
      1963
      1964
      1965
      1966
      ...
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
      Unnamed: 61
    
    
      Country Name
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Aruba
      ABW
      Air transport, passengers carried
      IS.AIR.PSGR
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Afghanistan
      AFG
      Air transport, passengers carried
      IS.AIR.PSGR
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      1999127.0
      2.279341e+06
      1.737962e+06
      2044188.0
      2209428.0
      1929907.728
      1917922.714
      NaN
    
    
      Angola
      AGO
      Air transport, passengers carried
      IS.AIR.PSGR
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      283887.0
      274869.0
      1010194.0
      9.877980e+05
      1.132424e+06
      1321872.0
      1409952.0
      1244491.000
      1261671.262
      NaN
    
    
      Albania
      ALB
      Air transport, passengers carried
      IS.AIR.PSGR
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      243691.0
      231263.0
      768533.0
      8.297789e+05
      8.143397e+05
      865848.0
      151632.0
      NaN
      26633.600
      NaN
    
    
      Andorra
      AND
      Air transport, passengers carried
      IS.AIR.PSGR
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 61 columns

We can ignore the first three and the last columns and we have to transpose the dataset transforming the year in a DateTime in order to be able to merge the two DataFrames:



In [30]:

    
# Transpose dataset
passengers = passengers.iloc[:, 3:-1].transpose()

# Set the year transformed in DateTime as the new index
passengers.reset_index(inplace=True)
passengers['index'] = passengers['index'].apply(lambda x: pd.to_datetime(pd.datetime(int(x), 1, 1)))
passengers.set_index('index', inplace=True)

passengers.head()









    Out[30]:






  
    
      Country Name
      Aruba
      Afghanistan
      Angola
      Albania
      Andorra
      Arab World
      United Arab Emirates
      Argentina
      Armenia
      American Samoa
      ...
      Virgin Islands (U.S.)
      Vietnam
      Vanuatu
      World
      Samoa
      Kosovo
      Yemen, Rep.
      South Africa
      Zambia
      Zimbabwe
    
    
      index
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1960-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1961-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1962-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1963-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1964-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 264 columns

Now we can sum all the columns to get the total number of passengers in a year and merge the two datasets:



In [31]:

    
# Sum all the columns
pd.DataFrame(passengers.fillna(0).sum(axis=1), columns=['Passengers'])

# Merge datasets using the indexes and selecting only some columns
all_data = pd.merge(pd.DataFrame(passengers.fillna(0).sum(axis=1), columns=['Passengers']), yearly_crash.iloc[:,1:], left_index=True, right_index=True, how='inner')

all_data









    Out[31]:






  
    
      
      Passengers
      Dead
      Aboard
      Ground
    
  
  
    
      1960-01-01
      0.000000e+00
      1735.0
      2121.0
      37.0
    
    
      1961-01-01
      0.000000e+00
      1521.0
      2089.0
      16.0
    
    
      1962-01-01
      0.000000e+00
      2123.0
      2540.0
      5.0
    
    
      1963-01-01
      0.000000e+00
      1379.0
      1642.0
      96.0
    
    
      1964-01-01
      0.000000e+00
      1354.0
      1717.0
      0.0
    
    
      1965-01-01
      0.000000e+00
      1824.0
      2387.0
      23.0
    
    
      1966-01-01
      0.000000e+00
      1712.0
      2001.0
      159.0
    
    
      1967-01-01
      0.000000e+00
      1814.0
      2358.0
      71.0
    
    
      1968-01-01
      0.000000e+00
      2284.0
      3058.0
      5.0
    
    
      1969-01-01
      0.000000e+00
      2101.0
      2722.0
      117.0
    
    
      1970-01-01
      1.866388e+09
      2118.0
      2790.0
      9.0
    
    
      1971-01-01
      1.996079e+09
      1951.0
      2587.0
      0.0
    
    
      1972-01-01
      1.835069e+09
      2946.0
      3644.0
      55.0
    
    
      1973-01-01
      2.407371e+09
      2477.0
      3481.0
      20.0
    
    
      1974-01-01
      2.886075e+09
      2387.0
      3123.0
      0.0
    
    
      1975-01-01
      3.001390e+09
      1659.0
      2377.0
      14.0
    
    
      1976-01-01
      3.292558e+09
      2045.0
      2717.0
      144.0
    
    
      1977-01-01
      3.583457e+09
      2173.0
      2722.0
      9.0
    
    
      1978-01-01
      4.015000e+09
      1576.0
      2740.0
      23.0
    
    
      1979-01-01
      4.518941e+09
      1992.0
      2489.0
      38.0
    
    
      1980-01-01
      4.523483e+09
      1829.0
      2754.0
      1.0
    
    
      1981-01-01
      4.553121e+09
      1245.0
      1577.0
      60.0
    
    
      1982-01-01
      4.660447e+09
      1794.0
      3163.0
      15.0
    
    
      1983-01-01
      4.864865e+09
      1612.0
      2367.0
      31.0
    
    
      1984-01-01
      5.187131e+09
      1033.0
      1495.0
      72.0
    
    
      1985-01-01
      5.526388e+09
      2670.0
      3480.0
      1.0
    
    
      1986-01-01
      5.922796e+09
      1471.0
      2567.0
      51.0
    
    
      1987-01-01
      6.350715e+09
      1723.0
      2231.0
      58.0
    
    
      1988-01-01
      6.711559e+09
      2034.0
      3037.0
      83.0
    
    
      1989-01-01
      6.906911e+09
      2283.0
      3633.0
      79.0
    
    
      1990-01-01
      7.201023e+09
      1182.0
      2265.0
      80.0
    
    
      1991-01-01
      8.442310e+09
      1839.0
      2459.0
      5.0
    
    
      1992-01-01
      8.649662e+09
      2121.0
      2967.0
      61.0
    
    
      1993-01-01
      8.529488e+09
      1571.0
      2420.0
      0.0
    
    
      1994-01-01
      9.182563e+09
      1876.0
      3108.0
      24.0
    
    
      1995-01-01
      9.704725e+09
      1593.0
      2112.0
      0.0
    
    
      1996-01-01
      1.032251e+10
      2386.0
      3058.0
      286.0
    
    
      1997-01-01
      1.079187e+10
      1672.0
      2485.0
      44.0
    
    
      1998-01-01
      1.086682e+10
      1544.0
      2011.0
      48.0
    
    
      1999-01-01
      1.150838e+10
      971.0
      3003.0
      36.0
    
    
      2000-01-01
      1.231302e+10
      1429.0
      2357.0
      23.0
    
    
      2001-01-01
      1.234622e+10
      1416.0
      2129.0
      5641.0
    
    
      2002-01-01
      1.214765e+10
      1433.0
      1798.0
      170.0
    
    
      2003-01-01
      1.250480e+10
      1279.0
      1510.0
      24.0
    
    
      2004-01-01
      1.429957e+10
      728.0
      937.0
      2.0
    
    
      2005-01-01
      1.502382e+10
      1317.0
      2176.0
      59.0
    
    
      2006-01-01
      1.594965e+10
      1151.0
      1431.0
      4.0
    
    
      2007-01-01
      1.715771e+10
      931.0
      1364.0
      57.0
    
    
      2008-01-01
      1.731750e+10
      823.0
      1466.0
      60.0
    
    
      2009-01-01
      1.780159e+10
      1095.0
      1657.0
      3.0
    
    
      2010-01-01
      2.125285e+10
      1085.0
      1514.0
      9.0
    
    
      2011-01-01
      2.279443e+10
      772.0
      1041.0
      27.0
    
    
      2012-01-01
      2.391928e+10
      596.0
      704.0
      52.0
    
    
      2013-01-01
      2.534156e+10
      311.0
      827.0
      5.0
    
    
      2014-01-01
      2.694625e+10
      359.0
      368.0
      0.0

The years between 1960 and 1969 doesn't seem to have passengers data, despite being present in the passengers' csv, so we are going to skip them:



In [32]:

    
all_data = all_data.loc['1970-01-01':]

all_data.head()









    Out[32]:






  
    
      
      Passengers
      Dead
      Aboard
      Ground
    
  
  
    
      1970-01-01
      1.866388e+09
      2118.0
      2790.0
      9.0
    
    
      1971-01-01
      1.996079e+09
      1951.0
      2587.0
      0.0
    
    
      1972-01-01
      1.835069e+09
      2946.0
      3644.0
      55.0
    
    
      1973-01-01
      2.407371e+09
      2477.0
      3481.0
      20.0
    
    
      1974-01-01
      2.886075e+09
      2387.0
      3123.0
      0.0

5 - Using data_all, create two graphs to visualize how the number of deaths and passengers vary with time, and, as always, make your plots as visually appealing as possible.



In [33]:

    
# Function for formatting the number of passengers labels in millions unit
def million_formatter(value, pos):
    return str(int(value / 1e6)) + 'M'



In [34]:

    
with plt.style.context('seaborn'):
    # Make two subplots, the first with number of passengers and the second with number of deaths
    fig, ax = plt.subplots(2, 1, sharex='col', figsize=(16,8))
    ax[0].plot(all_data['Passengers'])
    ax[1].plot(all_data['Dead'])
    # Set the limits of the common x axis
    ax[0].set_xlim(all_data.index.values.min(), all_data.index.values.max())
    # Set the titles for each plot
    ax[0].set_title('Number of passengers by year')
    ax[1].set_title('Deaths in plane crashes by year')
    # Format the labels for the number of passengers
    ax[0].yaxis.set_major_formatter(plt.FuncFormatter(million_formatter));

6 - Make a pie chart representing the number of deaths for each decade. Consult the pyplot documentation to play around with the settings a bit.



In [35]:

    
# Add a new column with the decade
all_data['Decade'] = all_data.index.year // 10 * 10









    



C:\Users\alessandro.diantonio\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [36]:

    
with plt.style.context('ggplot'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # Plot sum of Dead column grouped by decade
    ax.pie(all_data.groupby('Decade')['Dead'].sum(),
           labels=all_data['Decade'].unique(), # add labels
           counterclock=False, # change the order of the slices to clockwise
           startangle=90, # start from the top
           shadow=True, # add shadows
           autopct='%1.1f%%', # add percentage inside each slice
           explode=(0.1, 0, 0, 0, 0) # make the first slice pop out a bit
          )
    # Set a title
    ax.set_title('Number of deaths by decade')
    ;

	id	Date	location	operator	planeType	Dead	Aboard
0	1	1908-09-17	Fort Myer, Virginia	Military - U.S. Army	Wright Flyer III	1.0	2.0
1	2	1912-07-12	Atlantic City, New Jersey	Military - U.S. Navy	Dirigible	5.0	5.0
2	3	1913-08-06	Victoria, British Columbia, Canada	Private	Curtiss seaplane	1.0	1.0
3	4	1913-09-09	Over the North Sea	Military - German Navy	Zeppelin L-1 (airship)	14.0	20.0
4	5	1913-10-17	Near Johannisthal, Germany	Military - German Navy	Zeppelin L-2 (airship)	30.0	30.0

	id	Dead	Aboard	Ground
count	5666.000000	5655.000000	5625.000000	5592.000000
mean	2833.500000	19.811848	27.375822	1.543455
std	1635.777644	32.520087	42.564764	52.301120
min	1.000000	0.000000	0.000000	0.000000
25%	1417.250000	3.000000	5.000000	0.000000
50%	2833.500000	9.000000	13.000000	0.000000
75%	4249.750000	22.000000	30.000000	0.000000
max	5666.000000	583.000000	644.000000	2750.000000

	id	Dead	Aboard	Ground
Date
1908-01-01	1.0	1.0	2.0	0.0
1909-01-01	0.0	0.0	0.0	0.0
1910-01-01	0.0	0.0	0.0	0.0
1911-01-01	0.0	0.0	0.0	0.0
1912-01-01	2.0	5.0	5.0	0.0

	id	Dead	Aboard	Ground
Date
1972-01-01	288015.0	2946.0	3644.0	55.0
1985-01-01	278055.0	2670.0	3480.0	1.0

	id	Dead	Aboard	Ground
count	74.000000	74.000000	74.000000	74.000000
mean	3757.500000	36.081081	47.027027	0.013514
std	21.505813	80.429627	87.298727	0.116248
min	3721.000000	0.000000	1.000000	0.000000
25%	3739.250000	3.000000	4.000000	0.000000
50%	3757.500000	8.000000	11.000000	0.000000
75%	3775.750000	28.250000	44.750000	0.000000
max	3794.000000	520.000000	524.000000	1.000000

	id	location	operator	planeType	Dead	Aboard	Ground
Date
1985-08-12	3765	Mt. Osutaka, near Ueno Village, Japan	Japan Air Lines	Boeing B-747-SR46	520.0	524.0	0.0
1985-06-23	3757	Atlantic Ocean, 110 miles West of Ireland	Air India	Boeing B-747-237B	329.0	329.0	0.0
1985-12-12	3792	Gander, Newfoundland, Canada	Arrow Airways	McDonnell Douglas DC-8 Super 63PF	256.0	256.0	0.0
1985-07-10	3760	Near Uchuduk, Uzbekistan, USSR	Aeroflot	Tupolev TU-154B-2	200.0	200.0	0.0
1985-02-19	3736	Near Durango, Vizcaya, Spain	Iberia Airlines	Boeing B-727-256	148.0	148.0	0.0
1985-08-02	3763	Ft. Worth-Dallas, Texas	Delta Air Lines	Lockheed L-1011-1 TriStar	134.0	163.0	1.0
1985-07-24	3762	Leticia, Colombia	Fuerza A????rea Colombiana	Douglas DC-6B	80.0	80.0	0.0
1985-05-03	3749	Near L'vov, Ukraine, USSR	Aeroflot / Soviet Air Force	Tupolev TU-134A / Antonov An-26	76.0	76.0	0.0
1985-01-21	3726	Reno, Nevada	Galaxy Airlines	Lockheed L-188A Electra	70.0	71.0	0.0
1985-11-24	3790	Luqa, Malta	EgyptAir	Boeing B-737-266	60.0	103.0	0.0

	id	Dead	Aboard	Ground
count	81.000000	81.000000	81.000000	81.000000
mean	4630.000000	29.456790	37.753086	3.530864
std	23.526581	57.069705	64.485760	25.275525
min	4590.000000	1.000000	1.000000	0.000000
25%	4610.000000	3.000000	6.000000	0.000000
50%	4630.000000	8.000000	14.000000	0.000000
75%	4650.000000	20.000000	28.000000	0.000000
max	4670.000000	349.000000	349.000000	225.000000

	id	location	operator	planeType	Dead	Aboard	Ground
Date
1996-11-12	4655	Near Charkhi Dadri, India	Saudi Arabian Airlines / Kazakhstan Airlines	Boeing B-747-168B / Ilyushin IL-76TD	349.0	349.0	0.0
1996-07-17	4631	Off East Moriches, New York	Trans World Airlines	Boeing B-747-131	230.0	230.0	0.0
1996-02-06	4598	Off Puerto Plata, Domincan Republic	Alas Nacionales, leased from Birgen Air	Boeing B-757-225	189.0	189.0	0.0
1996-11-07	4653	Lagos, Nigeria	Aviation Development Corporation	Boeing B-727-231	143.0	143.0	0.0
1996-08-29	4639	Spitsbergen, Norway	Vnokovo Airlines	Tupolev TU-154M	141.0	141.0	0.0
1996-11-23	4660	Off Moroni, Comoros	Ethiopian Airlines	Boeing B-767-200ER	127.0	175.0	0.0
1996-02-29	4605	Arequipa, Peru	Compania de Aviacion Faucett SA (Peru)	Boeing B-737-222	123.0	123.0	0.0
1996-05-11	4618	Everglades, Miami, Florida	ValuJet	McDonnell Douglas DC-9-32	110.0	110.0	0.0
1996-10-31	4651	Sao Paolo, Brazil	TAM (Brazil)	Fokker 100	95.0	95.0	3.0
1996-02-26	4603	Near Jabal Awliya, Sudan	Military - Sudanese Air Force	Lockheed C-130H	91.0	91.0	0.0

	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	1966	...	2008	2009	2010	2011	2012	2013	2014	2015	2016	Unnamed: 61
Country Name
Aruba	ABW	Air transport, passengers carried	IS.AIR.PSGR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Afghanistan	AFG	Air transport, passengers carried	IS.AIR.PSGR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	1999127.0	2.279341e+06	1.737962e+06	2044188.0	2209428.0	1929907.728	1917922.714	NaN
Angola	AGO	Air transport, passengers carried	IS.AIR.PSGR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	283887.0	274869.0	1010194.0	9.877980e+05	1.132424e+06	1321872.0	1409952.0	1244491.000	1261671.262	NaN
Albania	ALB	Air transport, passengers carried	IS.AIR.PSGR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	243691.0	231263.0	768533.0	8.297789e+05	8.143397e+05	865848.0	151632.0	NaN	26633.600	NaN
Andorra	AND	Air transport, passengers carried	IS.AIR.PSGR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Country Name	Aruba	Afghanistan	Angola	Albania	Andorra	Arab World	United Arab Emirates	Argentina	Armenia	American Samoa	...	Virgin Islands (U.S.)	Vietnam	Vanuatu	World	Samoa	Kosovo	Yemen, Rep.	South Africa	Zambia	Zimbabwe
index
1960-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1961-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1962-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1963-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1964-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Passengers	Dead	Aboard	Ground
1960-01-01	0.000000e+00	1735.0	2121.0	37.0
1961-01-01	0.000000e+00	1521.0	2089.0	16.0
1962-01-01	0.000000e+00	2123.0	2540.0	5.0
1963-01-01	0.000000e+00	1379.0	1642.0	96.0
1964-01-01	0.000000e+00	1354.0	1717.0	0.0
1965-01-01	0.000000e+00	1824.0	2387.0	23.0
1966-01-01	0.000000e+00	1712.0	2001.0	159.0
1967-01-01	0.000000e+00	1814.0	2358.0	71.0
1968-01-01	0.000000e+00	2284.0	3058.0	5.0
1969-01-01	0.000000e+00	2101.0	2722.0	117.0
1970-01-01	1.866388e+09	2118.0	2790.0	9.0
1971-01-01	1.996079e+09	1951.0	2587.0	0.0
1972-01-01	1.835069e+09	2946.0	3644.0	55.0
1973-01-01	2.407371e+09	2477.0	3481.0	20.0
1974-01-01	2.886075e+09	2387.0	3123.0	0.0
1975-01-01	3.001390e+09	1659.0	2377.0	14.0
1976-01-01	3.292558e+09	2045.0	2717.0	144.0
1977-01-01	3.583457e+09	2173.0	2722.0	9.0
1978-01-01	4.015000e+09	1576.0	2740.0	23.0
1979-01-01	4.518941e+09	1992.0	2489.0	38.0
1980-01-01	4.523483e+09	1829.0	2754.0	1.0
1981-01-01	4.553121e+09	1245.0	1577.0	60.0
1982-01-01	4.660447e+09	1794.0	3163.0	15.0
1983-01-01	4.864865e+09	1612.0	2367.0	31.0
1984-01-01	5.187131e+09	1033.0	1495.0	72.0
1985-01-01	5.526388e+09	2670.0	3480.0	1.0
1986-01-01	5.922796e+09	1471.0	2567.0	51.0
1987-01-01	6.350715e+09	1723.0	2231.0	58.0
1988-01-01	6.711559e+09	2034.0	3037.0	83.0
1989-01-01	6.906911e+09	2283.0	3633.0	79.0
1990-01-01	7.201023e+09	1182.0	2265.0	80.0
1991-01-01	8.442310e+09	1839.0	2459.0	5.0
1992-01-01	8.649662e+09	2121.0	2967.0	61.0
1993-01-01	8.529488e+09	1571.0	2420.0	0.0
1994-01-01	9.182563e+09	1876.0	3108.0	24.0
1995-01-01	9.704725e+09	1593.0	2112.0	0.0
1996-01-01	1.032251e+10	2386.0	3058.0	286.0
1997-01-01	1.079187e+10	1672.0	2485.0	44.0
1998-01-01	1.086682e+10	1544.0	2011.0	48.0
1999-01-01	1.150838e+10	971.0	3003.0	36.0
2000-01-01	1.231302e+10	1429.0	2357.0	23.0
2001-01-01	1.234622e+10	1416.0	2129.0	5641.0
2002-01-01	1.214765e+10	1433.0	1798.0	170.0
2003-01-01	1.250480e+10	1279.0	1510.0	24.0
2004-01-01	1.429957e+10	728.0	937.0	2.0
2005-01-01	1.502382e+10	1317.0	2176.0	59.0
2006-01-01	1.594965e+10	1151.0	1431.0	4.0
2007-01-01	1.715771e+10	931.0	1364.0	57.0
2008-01-01	1.731750e+10	823.0	1466.0	60.0
2009-01-01	1.780159e+10	1095.0	1657.0	3.0
2010-01-01	2.125285e+10	1085.0	1514.0	9.0
2011-01-01	2.279443e+10	772.0	1041.0	27.0
2012-01-01	2.391928e+10	596.0	704.0	52.0
2013-01-01	2.534156e+10	311.0	827.0	5.0
2014-01-01	2.694625e+10	359.0	368.0	0.0